ARIMAX
0.1 Vector autoregression (VAR)
I decide to use VAR model because, the model captures the linear dependencies between the different time series. The main assumption in VAR is that all variables are endogenous, meaning that they are influenced by each other, which I believe is the case for my analysis.
0.2 Literature Review
This literature review aims to provide insights into the choice of the Vector Autoregression (VAR) model as a preferred method for predicting stock market time series data. We will discuss the advantages of the VAR model, its use in the existing literature, and its performance in comparison to other time series models. The review will also highlight the limitations and potential improvements for the VAR model in the context of stock market prediction.
The prediction of stock market behavior has been an area of interest for researchers, investors, and financial institutions. Several models have been proposed to analyze time series data in order to forecast stock market movements. Among them, the Vector Autoregression (VAR) model has emerged as a popular choice. VAR models have been widely used in economics and finance to analyze the dynamic relationships between multiple time series variables (Sims, 1980).
- Advantages of VAR model:
- Multivariate framework: Unlike univariate models like ARIMA, the VAR model incorporates the interdependencies between multiple variables, allowing for a more comprehensive understanding of the underlying relationships (Lütkepohl, 2005).
- Flexibility: The VAR model does not impose strong theoretical assumptions about the relationship between the variables, which makes it a flexible choice for diverse applications (Hamilton, 1994).
- Impulse response functions: The model allows for the analysis of impulse response functions, which can be helpful in understanding the dynamic responses of the system to shocks (Sims, 1980).
- Use of VAR model in existing literature:
- Stock market prediction: Various studies have employed VAR models for stock market predictions, such as Yüksel et al. (2015), who used a VAR model to predict stock prices in the Turkish stock market.
- Economic forecasting: The VAR model has been widely used in predicting macroeconomic variables, such as interest rates, inflation, and GDP (Stock & Watson, 2001; Bernanke, 2005).
- Limitations and potential improvements:
- High-dimensionality issues: As the number of variables in the VAR model increases, the model’s complexity and computational requirements grow, leading to potential overfitting issues (Kilian & Lütkepohl, 2017).
- Nonlinear relationships: The VAR model assumes linearity in relationships between variables, which may not always hold true, limiting its applicability (Terasvirta, 1994).
0.3 Data Processing
Code
# Read the market breadth data
spx_breadth_data <- read_csv("spx_market_breadth.csv") %>%
mutate(Date = ymd(Date)) %>%
rename(Breadth = `0`)
# Define the sector symbols and the VIX index
symbols <- c("XLB", "XLC", "XLE", "XLF", "XLI", "XLK", "XLP", "XLRE", "XLU", "XLV", "XLY", "^VIX","SPY")
# Initialize an empty data frame for the price data
price_data <- data.frame()
# Loop through the symbols
for (symbol in symbols) {
# Get the adjusted price data
tmp <- tq_get(symbol, from = "2020-04-07", to = Sys.Date(), get = "stock.prices", source = "yahoo") %>%
dplyr::select(date, adjusted) %>%
dplyr::rename(!!symbol := adjusted)
# Merge the data into the price_data data frame
if (nrow(price_data) == 0) {
price_data <- tmp
} else {
price_data <- price_data %>%
dplyr::full_join(tmp, by = "date")
}
}Code
# A tibble: 6 × 15
Date Breadth XLB XLC XLE XLF XLI XLK XLP XLRE XLU XLV
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2023-04-21 NA 80.7 57.7 85.0 33.2 101. 148. 76.6 37.2 69.6 134.
2 2023-04-24 NA 81.2 57.6 86.3 33.1 101. 147. 76.9 37.1 70.0 135.
3 2023-04-25 NA 79.5 56.7 84.7 32.6 99.0 144. 76.8 36.7 69.9 134.
4 2023-04-26 NA 78.6 56.1 83.5 32.3 97.1 146. 76.2 36.4 68.3 132.
5 2023-04-27 NA 79.6 59.4 83.8 32.8 99.0 149. 77.0 37.3 69.1 132.
6 2023-04-28 NA 80.6 59.9 85.1 33.2 100 151. 77.4 37.8 69.0 134.
# … with 3 more variables: XLY <dbl>, `^VIX` <dbl>, SPY <dbl>
0.4 Variable selection
Since the market breadth is a measurement of the SPX stock market sentiment, thus it is reasonable to believe that it should correlated with “^VIX”, Chicago Board Options Exchange’s CBOE Volatility Index, a popular measure of the stock market’s expectation of volatility based on S&P 500 index options.
Also XLK, XLF, XLI and other sector are the subsectors of the SPX market, they are important component when calculating the market breadth (by summing the market breadth of each sector). Since XLK (etf that track the information technology) and XLF (Financial Select Sector SPDR Fund) are the two most prominent etf that tracking the subsectors of the S&P 500, also including the stock tickers that worth 40% of the total S%P 500.
I will also include SPY into the model since SPY is the etf that tracking the S&P500 index, which I believe should be correlated with the market breadth.
Code
# A tibble: 6 × 6
Date Breadth XLF XLK `^VIX` SPY
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2023-03-30 796. 31.8 149. 19.0 404.
2 2023-03-31 986. 32.2 151. 18.7 409.
3 2023-04-03 1004. 32.2 151. 18.5 411.
4 2023-04-04 909. 31.9 150. 19 409.
5 2023-04-05 869. 31.9 148. 19.1 408.
6 2023-04-06 852. 32.0 149. 18.4 409.
0.5 Model Selection and Fitting
$selection
AIC(n) HQ(n) SC(n) FPE(n)
1 1 1 1
$criteria
1 2 3 4 5 6
AIC(n) 8.038050 8.040759 8.064650 8.084700 8.115804 8.148135
HQ(n) 8.121496 8.183809 8.267305 8.346960 8.437668 8.529604
SC(n) 8.254559 8.411916 8.590456 8.765156 8.950909 9.137889
FPE(n) 3096.578091 3105.011458 3180.168054 3244.723857 3347.477331 3457.837250
7 8 9 10
AIC(n) 8.157219 8.179779 8.219924 8.250792
HQ(n) 8.598291 8.680456 8.780206 8.870678
SC(n) 9.301621 9.478830 9.673625 9.859141
FPE(n) 3489.891058 3570.191896 3717.325965 3834.996314
It is clear that according to selection criteria p=1 are good. This also make the model selection much easier as there’s no need for model comparison since there’s only 1 possible choice of p.
VAR Estimation Results:
=========================
Endogenous variables: Breadth, XLF, XLK, X.VIX, SPY
Deterministic variables: both
Sample size: 755
Log Likelihood: -8401.628
Roots of the characteristic polynomial:
0.9957 0.9733 0.9447 0.9232 0.9117
Call:
VAR(y = spx_selected[, c(2:6)], p = 1, type = "both")
Estimation results for equation Breadth:
========================================
Breadth = Breadth.l1 + XLF.l1 + XLK.l1 + X.VIX.l1 + SPY.l1 + const + trend
Estimate Std. Error t value Pr(>|t|)
Breadth.l1 0.913301 0.019362 47.171 <2e-16 ***
XLF.l1 5.935174 4.953754 1.198 0.2313
XLK.l1 2.657364 1.895438 1.402 0.1613
X.VIX.l1 -0.055651 1.188483 -0.047 0.9627
SPY.l1 -1.840574 1.242666 -1.481 0.1390
const 215.726640 114.718562 1.880 0.0604 .
trend 0.007133 0.026931 0.265 0.7912
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 110.7 on 748 degrees of freedom
Multiple R-Squared: 0.8248, Adjusted R-squared: 0.8234
F-statistic: 586.8 on 6 and 748 DF, p-value: < 2.2e-16
Estimation results for equation XLF:
====================================
XLF = Breadth.l1 + XLF.l1 + XLK.l1 + X.VIX.l1 + SPY.l1 + const + trend
Estimate Std. Error t value Pr(>|t|)
Breadth.l1 -3.801e-05 8.160e-05 -0.466 0.642
XLF.l1 9.980e-01 2.088e-02 47.801 <2e-16 ***
XLK.l1 1.704e-03 7.988e-03 0.213 0.831
X.VIX.l1 1.835e-03 5.009e-03 0.366 0.714
SPY.l1 -8.896e-04 5.237e-03 -0.170 0.865
const 1.913e-01 4.835e-01 0.396 0.693
trend -4.570e-05 1.135e-04 -0.403 0.687
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4666 on 748 degrees of freedom
Multiple R-Squared: 0.9929, Adjusted R-squared: 0.9929
F-statistic: 1.749e+04 on 6 and 748 DF, p-value: < 2.2e-16
Estimation results for equation XLK:
====================================
XLK = Breadth.l1 + XLF.l1 + XLK.l1 + X.VIX.l1 + SPY.l1 + const + trend
Estimate Std. Error t value Pr(>|t|)
Breadth.l1 -0.0002506 0.0003811 -0.657 0.5111
XLF.l1 0.1610367 0.0975100 1.651 0.0991 .
XLK.l1 1.0195648 0.0373099 27.327 <2e-16 ***
X.VIX.l1 -0.0008661 0.0233942 -0.037 0.9705
SPY.l1 -0.0286052 0.0244607 -1.169 0.2426
const 3.6083827 2.2581277 1.598 0.1105
trend -0.0002494 0.0005301 -0.470 0.6381
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.179 on 748 degrees of freedom
Multiple R-Squared: 0.9875, Adjusted R-squared: 0.9874
F-statistic: 9863 on 6 and 748 DF, p-value: < 2.2e-16
Estimation results for equation X.VIX:
======================================
X.VIX = Breadth.l1 + XLF.l1 + XLK.l1 + X.VIX.l1 + SPY.l1 + const + trend
Estimate Std. Error t value Pr(>|t|)
Breadth.l1 -1.611e-05 3.334e-04 -0.048 0.961
XLF.l1 -9.522e-02 8.529e-02 -1.116 0.265
XLK.l1 -1.026e-02 3.264e-02 -0.314 0.753
X.VIX.l1 9.101e-01 2.046e-02 44.474 <2e-16 ***
SPY.l1 1.048e-02 2.140e-02 0.490 0.625
const 2.361e+00 1.975e+00 1.196 0.232
trend 3.490e-04 4.637e-04 0.753 0.452
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.906 on 748 degrees of freedom
Multiple R-Squared: 0.8811, Adjusted R-squared: 0.8802
F-statistic: 923.9 on 6 and 748 DF, p-value: < 2.2e-16
Estimation results for equation SPY:
====================================
SPY = Breadth.l1 + XLF.l1 + XLK.l1 + X.VIX.l1 + SPY.l1 + const + trend
Estimate Std. Error t value Pr(>|t|)
Breadth.l1 -0.0002832 0.0008052 -0.352 0.7252
XLF.l1 0.3963196 0.2060121 1.924 0.0548 .
XLK.l1 0.1063642 0.0788257 1.349 0.1776
X.VIX.l1 0.0129448 0.0494255 0.262 0.7935
SPY.l1 0.9076019 0.0516788 17.562 <2e-16 ***
const 8.9520506 4.7708090 1.876 0.0610 .
trend -0.0004527 0.0011200 -0.404 0.6862
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.604 on 748 degrees of freedom
Multiple R-Squared: 0.991, Adjusted R-squared: 0.9909
F-statistic: 1.373e+04 on 6 and 748 DF, p-value: < 2.2e-16
Covariance matrix of residuals:
Breadth XLF XLK X.VIX SPY
Breadth 12257.85 38.8118 160.1307 -136.4712 410.918
XLF 38.81 0.2177 0.6022 -0.5563 1.711
XLK 160.13 0.6022 4.7495 -2.8434 9.334
X.VIX -136.47 -0.5563 -2.8434 3.6339 -6.626
SPY 410.92 1.7105 9.3337 -6.6260 21.200
Correlation matrix of residuals:
Breadth XLF XLK X.VIX SPY
Breadth 1.0000 0.7513 0.6637 -0.6466 0.8061
XLF 0.7513 1.0000 0.5922 -0.6254 0.7962
XLK 0.6637 0.5922 1.0000 -0.6844 0.9302
X.VIX -0.6466 -0.6254 -0.6844 1.0000 -0.7549
SPY 0.8061 0.7962 0.9302 -0.7549 1.0000
0.6 Model Forecast
Code
# Forecast using the model
n_periods <- 30 # Specify the number of periods for the forecast
forecast <- predict(best_model, n.ahead = n_periods)
# Display the forecast results
# print(forecast$fcst)
# Convert the forecasts to a data frame
forecast_df <- data.frame(forecast$fcst)
# Create a new data frame with the original data and the forecasts
original_data_end <- tail(spx_selected, n = 1)$Date
forecast_dates <- seq(from = original_data_end + 1, by = 1, length.out = n_periods)
forecast_df <- data.frame(Date = forecast_dates, forecast_df)
# Select only the forecast columns
forecast_df <- forecast_df %>%
dplyr::select(Date, Breadth = Breadth.fcst, XLF = XLF.fcst, XLK = XLK.fcst, `^VIX` = X.VIX.fcst, SPY = SPY.fcst)
# Combine original data and forecasts
combined_data <- rbind(spx_selected, forecast_df)
# Convert combined_data to a long format
combined_data_long <- combined_data %>%
tidyr::pivot_longer(cols = -Date, names_to = "Variable", values_to = "Value")
# Create an interactive plot with a range slider
p <- plot_ly(data = combined_data_long, x = ~Date, y = ~Value, color = ~Variable, type = "scatter", mode = "lines") %>%
layout(title = "Original Data and Forecasts",
xaxis = list(title = "Date",
rangeslider = list(visible = T)),
yaxis = list(title = "Values"),
shapes = list(list(type = "line",
x0 = original_data_end,
x1 = original_data_end,
y0 = min(combined_data_long$Value),
y1 = max(combined_data_long$Value),
yref = "y",
xref = "x",
line = list(color = "red", dash = "dash"))),
annotations = list(list(x = original_data_end,
y = max(combined_data_long$Value),
xref = "x",
yref = "y",
text = "Forecast",
showarrow = F,
ax = 0,
ay = -15,
font = list(size = 12, color = "red"))))
# Show the plot
p0.7 Finding conclusion
From the VAR estimation results, you can make several observations about the relationships between the endogenous variables (Breadth, XLF, XLK, X.VIX, and SPY) and their lagged values:
Breadth: The lagged value of Breadth (Breadth.l1) is highly significant (p-value < 0.001), indicating a strong positive relationship with the current value of Breadth.
XLF: The lagged value of XLF (XLF.l1) is highly significant (p-value < 0.001), indicating a strong positive relationship with the current value of XLF.
XLK: The lagged value of XLK (XLK.l1) is highly significant (p-value < 0.001), indicating a strong positive relationship with the current value of XLK.
X.VIX: The lagged value of X.VIX (X.VIX.l1) is highly significant (p-value < 0.001), indicating a strong positive relationship with the current value of X.VIX.
SPY: The lagged value of SPY (SPY.l1) is highly significant (p-value < 0.001), indicating a strong positive relationship with the current value of SPY.
Overall, the model suggests that the lagged values of the endogenous variables are important predictors of their current values. However, not all the coefficients for other variables in each equation are significant. For example, the lagged value of XLF in the Breadth equation is not significant (p-value = 0.2313), which suggests that the lagged value of XLF does not have a strong relationship with the current value of Breadth.
Additionally, the adjusted R-squared values for all the equations are relatively high, which indicates that the model explains a large proportion of the variation in the data. The highest adjusted R-squared value is for the XLF equation (0.9929), while the lowest is for the Breadth equation (0.8234).
The residuals’ covariance and correlation matrices show that the residuals are correlated across equations. This correlation indicates that there might be some underlying common factors that affect all the endogenous variables.
In conclusion, the VAR model provides insights into the relationships between the selected endogenous variables and their lagged values. It is essential to take into account these relationships when making forecasts or interpreting the results. However, it’s important to note that the VAR model is a linear model, and there could be nonlinear relationships or other factors not captured by the model. Moreover, it would be useful to validate the model’s performance with out-of-sample data and consider alternative modeling approaches if necessary.